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Abstract:  Reliability  data  and  accurate  compact  models 
for  new  semiconductor  processes  are  often  very  expensive 
to  obtain.  This  paper  considers  low-cost  collection  of 
empirical  data  using  COTS  FPGAs  and  novel  self-test. 
Hardware  experiments  using  a  28  nm  FPGA  demonstrate 
isolation  of  small  sets  of  transistors,  detection  of  subtle 
aging,  and  measurement  precision  more  than  10  x  finer 
(30-60 femtoseconds)  than  state  of  the  art. 
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Introduction 

Each  semiconductor  process  technology  brings  a  unique  set 
of  reliability  characteristics  and  physical  behaviors  that  are 
difficult  for  the  government  to  accurately  observe  and 
model.  Empirical  reliability  data  from  the  foundry  is 
typically  inaccessible  or  insufficient  for  government 
applications.  Ad  hoc  test  chips  can  be  designed  and 
characterized,  but  at  great  cost.  An  additional  possibility  is 
the  collection  of  extensive  empirical  data  using  COTS 
devices  capable  of  self-test.  We  consider  the  use  of  FPGAs 
and  novel  characterization  methods  for  this  purpose. 
Specifically,  we  study  whether  small  sets  of  transistors  in 
FPGAs  can  be  isolated,  and  whether  high  precision  can  be 
achieved  such  that  subtle  aging  effects  can  be  observed. 
We  find  that  there  are  indeed  cases  in  which  individual 
transistors  can  be  monitored,  and  that  by  taking  careful 
measurements,  effects  such  as  NBTI  can  be  monitored 
through  low-cost  highly-parallel  self-test,  at  levels  of 
precision  10x  finer  (30-60  femtoseconds)  than  state-of-the- 
art. 

Related  Work 

Most  previous  work  involving  delay  characterization  using 
FPGAs  has  involved  relatively  coarse  precision.  Some  of 
the  highest  levels  of  precision  achieved  have  been  0.61  ps 
[1]  and  3.2  ps  [2].  Aging  in  a  65  nm  FPGA  has  been 
successfully  tracked  [3].  We  explored  femtosecond-scale 
characterization  in  a  hardware  security  context  [4];  that 
work  is  the  basis  for  this  paper.  The  idea  of  creating  a 
reliability  lab-on-a-chip  using  FPGAs  has  been  recently 
considered  by  Pfiefer  at  al.  [5];  the  claimed  resolution  (not 
precision)  is  0.1  ps.  That  approach  characterizes  ring 
oscillators  implemented  in  a  reconfigurable  fabric.  Though 
useful  in  some  contexts,  the  paths  involved  are  dominated 
by  FPGA  interconnect  delays  which  are  not  always  of 
interest  or  representative.  Moreover,  delays  are  measured 


over  a  large  number  of  transistors  in  the  ring  rather  than  a 
small  number  of  isolated  transistors. 

Proposed  Approach 

Identifying  Components  to  Monitor:  A  major  challenge  in 
using  FPGAs  for  low-level  characterization  is  that  the 
circuit  design  details  are  proprietary.  Fortunately, 
information  is  available  about  typical  transistor-level 
structures  of  SRAM-based  lookup  tables  (LUTs),  through 
patents  and  academic  models  such  as  [3].  LUTs  are  a 
particularly  strong  candidate  for  monitoring  aging  since 
their  SRAM  cells  typically  hold  static  data;  certain 
transistors  will  be  statically  biased  and  potentially  prone  to 
bias-dependent  effects.  With  many  semiconductor 
technologies,  transistors  will  degrade  under  certain  gate- 
source  bias,  especially  at  high  temperature;  this  bias 
temperature  instability  (BTI)  effect  shifts  a  transistor’s 
threshold  voltage  and  limits  its  saturation  drain  current. 
PMOS  devices  degrade  under  negative  bias  (NBTI)  and 
NMOS  under  positive  bias  (PBTI).  In  some  cases  partial 
recovery  occurs  when  stress  is  absent. 


Figure  1.  Portion  of  an  SRAM-based  lookup  table  circuit. 
Individual  transistors  can  be  monitored  and  compared  with 
high-precision.  A  test  case  for  transistor  Ml  is  shown. 


Fig.  1  shows  a  representative  design  of  a  portion  of  LUT 
circuitry,  similar  to  [3].  SRAM  cells  (not  shown)  hold  the 
static  configuration  data  specifying  the  LUT  function.  A 
set  of  inverters  drive  the  SRAM  contents  into  a  pass-gate 
multiplexor  tree;  we  will  refer  to  these  inverters  as  the 
SRAM  drivers,  with  transistors  labeled  M0:M3  in  the 
figure.  The  polarity  of  the  signals  driven  by  the  SRAM 
drivers  is  not  documented;  it  could  be  true  or 
complemented  polarity  with  respect  to  the  SRAM  contents. 
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The  SRAM  driver  transistors  are  statically  biased;  we 
hypothesize  that  these  transistors  might  act  as  “canaries” 
for  subtle  degradation  of  the  BTI  type.  Characterizing  their 
switching  speeds  may  not  be  practical,  since  their  gates  are 
controlled  by  the  LUT  contents  which  normally  can  only  be 
changed  during  the  reconfiguration  process.  There  is 
another  possibility — characterizing  their  drive  strengths 
while  they  are  turned  on.  The  drivers  are  responsible  for 
charging  and  discharging  the  nodes  along  the  multiplexor 
tree  (such  as  nodes  A  and  B)  when  the  LSB  of  the  address 
bus  (AddrO)  switches.  The  delay  through  a  path  of  this 
type  can  be  measured,  potentially  providing  information 
about  the  drive  strength  of  the  transistor  of  interest.  Note 
that  the  saturation  drain  current  depends  on  the  threshold 
voltage  [6],  approximated  by 

lD,sat«(Vcs-  (1) 

Therefore,  a  measured  increase  in  delay  can  be  an 
indication  of  degraded  current  capability  and  associated  Vh 
for  instance  caused  by  NBTI  wear  out. 

A  set  of  four  FPGA  configurations  can  be  designed  that 
allows  characterization  of  the  four  types  of  transistors  of 
interest.  Similar  to  known  methods  of  FPGA  delay  fault 
testing  [7],  the  LUT  can  be  configured  with  either  of  two 
alternating  patterns,  and  the  LSB  of  the  address  bus 
(AddrO)  connected  to  the  launch  signal  while  the  other 
address  inputs  are  selectable  per  test.  The  launch  signal  can 
be  driven  with  either  a  rising  or  falling  edge  to  trigger  a 
path.  Subsequently,  a  single  transistor  under  test  will  be 
responsible  for  pulling  downstream  nodes  high  or  low 
through  the  NMOS  pass  gates.  In  the  Fig.  1  example, 
nodes  A  and  B  can  be  charged  up  through  MO  or  Ml,  or 
discharged  through  M2  or  M3.  We  refer  to  paths  tested  by 
AddrO  l  (f)  as  “even”  (odd)  paths,  and  we  refer  to  paths  in 
which  the  LUT  output  rises  (falls)  as  “rising”  (“falling”) 
paths.  A  typical  FPGA  LUT  may  have  6  address  inputs 
and  64  SRAM  cells,  and  thus  have  64  buffers  containing 
128  transistors  of  interest. 

The  paths  from  AddrO  to  the  LUT  output  depend  not  only 
on  M0:M3  but  also  on  the  varying  delays  from  AddrO  to 
the  first  level  of  pass  gates,  and  on  downstream  NMOS 
pass  gates  and  level  restorers  such  as  the  one  between 
nodes  B  and  C.  Delays  through  the  downstream  logic  will 
be  affected  by  dynamic  activity  on  the  address  bus  as  well 
as  data-dependent  wear  out.  To  isolate  component  shifts  in 
M0:M3,  relative  comparisons  are  performed  that  subtract 
out  the  “noise”  of  the  upstream  and  downstream  logic.  The 
measured  parameters  for  each  SRAM  driver  can  be 
compared  against  those  of  the  adjacent  driver  and  others  in 
the  same  neighborhood. 

Improvements  to  Test  Architecture:  Measurements  of 
component  delays  are  subject  to  both  random  and 
systematic  error.  Random  error  is  the  unpredictable 
deviation  in  measured  values  caused  by  clock  jitter,  thermal 


fluctuations,  and  many  other  types  of  noise.  The  amount  of 
error  dictates  the  precision  of  the  measurement  system. 
Random  error  can  often  be  lessened  through  repeated 
measurements,  but  only  partially  and  with  a  linear  increase 
in  test  time. 


In  delay  testing,  a  pair  of  test  vectors  that  sensitize  a  path  or 
node  of  interest  is  applied  to  the  circuit  under  test  and  the 
result  is  captured  by  a  sequential  element,  providing  a 
yes/no  answer  per  trial.  Some  methodologies  prescribe 
repeating  the  process  with  different  timing  intervals 
between  the  launch  signal  and  the  capture  signal;  this  is 
alternatively  called  launch-and-capture  or  clock  sweeping. 
With  FPGAs,  testing  has  traditionally  been  performed  on 
relatively  long  paths,  representing  a  full  clock  phase  or 
even  a  full  clock  period.  This  is  appropriate  for 
characterizing  critical  paths  or  detecting  gross  path  delay 
faults,  but  for  characterizing  subtle  aging  effects,  there  is  a 
need  to  isolate  very  short  paths  and  detect  very  fine  (sub¬ 
picosecond)  phenomena. 

For  isolating  short  paths,  we  start  with  on-chip  phase 
shifting.  Using  a  single  source  such  as  an  on-chip  voltage 
controlled  oscillator,  two  clocks  are  generated  with  a  small 
phase  shift.  As  an  example,  Xilinx  Kintex-7  clock  circuitry 
can  generate  phase  shifts  of  roughly  13  ps.  This  enables 
characterization  of  paths  nearly  100x  shorter  than  typical 
(180-degree  launch  and  capture  is  limited  to  paths  longer 
than  1240  ps  in  [2]). 

However,  standard  phase  shifting  is  inadequate  on  its  own, 
since  the  minimum  phase  increment  (e.g.  13  ps)  is  orders  of 
magnitude  larger  than  the  desired  resolution  and  precision. 
For  higher  precision,  we  recommend  the  use  of  arbitrary 
phase  shifting  of  launch  and  capture  signals,  via  fine 
adjustments  to  the  frequency  of  the  reference  clock  feeding 
the  above  mentioned  clock  generator.  To  perform  a  sweep, 
the  on-chip  fractional  phase  shift  is  set  to  a  fixed  value  (e.g. 
1/5  6th  of  the  voltage-controlled  oscillator  aka  VCO 
period),  and  the  reference  clock  is  very  finely  adjusted  with 
a  separate  programmable  oscillator.  Existing  oscillators 
(such  as  the  Si570  available  on  many  Xilinx  and  Altera 
development  boards)  have  resolution  better  than  0.09  parts 
per  billion,  readily  supporting  clock  sweeps  with 
femtosecond-scale  step  sizes. 


The  proposed  test  architecture  is  illustrated  in  Fig.  2. 
Further  details  regarding  minimizing  random  and 
systematic  error  can  be  found  in  Zick  et  al.  [4]. 
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Figure  2.  Overview  of  proposed  test  architecture 


Data  Analysis  Considerations:  The  launch-and-capture 
characterization  methodology  provides  a  set  of  signal 
probabilities  (SP)  versus  timing.  Typically,  a  simple  rule  is 
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employed  to  convert  the  signal  probability  data  to  a  path 
delay  estimate.  For  instance,  it  is  common  to  use  the  clock 
period  at  which  SP  is  closest  to  0.5  as  the  estimated  delay 
[1].  Another  rule  is  to  use  the  lowest  clock  frequency  at 
which  the  path  fails  at  least  once  [8].  We  refer  to  these 
rules  as  Nearest-to-Midpoint  and  First  Fail,  respectively. 
While  these  simple  single-point  rules  have  the  advantage  of 
being  lightweight,  they  are  prone  to  significant  random 
error  in  the  selected  point,  and  are  not  suitable  for  the 
highest-precision  characterization.  We  recommend  using 
more  of  the  data  points  in  order  to  improve  the  precision. 
Assuming  normally-distributed  noise,  the  SP  data  is 
essentially  the  integral  of  a  normal  distribution.  It  is  known 
that  such  a  scenario  can  be  modeled  as  a  sigmoid  function. 
Specifically,  SP  data  can  be  fitted  to  a  curve 

~  i  _j_  6-a{x-b ) 

where  x  is  the  launch-to-capture  interval,  A  is  the  slope,  and 
B  is  the  mid-point  of  the  curve.  The  path  delay  can  then  be 
estimated  from  the  curve  simply  by  using  the  value  of  B 
(when  x  =  B,  SPflt  =  0.5).  An  example  of  a  set  of  signal 
probabilities  and  the  associated  curve  are  shown  in  Fig.  3. 


Delay  (ps) 

Figure  3.  Estimating  path  timing  more  precisely  by  fitting  signal 
probability  data  to  a  curve  and  finding  the  50%  point.  In  this 
case,  the  path  delay  is  166.800  ps;  R2  =  0.9987. 

Experimental  Results 

Experimental  Setup:  The  base  characterization  platform  is  a 
Xilinx  Kintex-7  FPGA  manufactured  in  the  TSMC  28HPL 
high-K  metal  gate  process,  mounted  on  a  KC705 
development  board.  The  components  under  test  are  2048 
transistors  associated  with  1024  LUT  SRAM  cells  in  the 
style  of  Fig.  1.  The  transistors  reside  in  16  6-input  LUTs 
located  in  slices  X36Y179  to  X39Y179,  close  to  the  on- 
chip  XADC  sensors  which  monitor  voltage  and 
temperature.  Sixteen  components  are  tested  in  parallel  (one 
in  each  LUT).  Each  of  the  four  types  of  transistors  is  tested 
with  a  unique  FPGA  configuration.  The  system  design 
includes  instrumentation,  a  MicroBlaze  CPU,  and  a  control 
program,  all  developed  with  Xilinx  tools  version  13.4. 
Curve  fitting  is  performed  using  Eq.  3  and  the  nlinfit 
function  in  MATLAB. 

Overhead:  The  time  for  each  16-LUT  trial  is  measured  to 
be  roughly  1.5  ms,  largely  dictated  by  memory  accesses 
between  the  MicroBlaze  and  the  peripheral  containing  the 
launch  and  capture  instruments.  Additional  overheads  are 


associated  with  changing  the  frequencies  of  the  external 
oscillator  (12  ms  lock  time)  and  on-chip  clock  generator, 
and  the  printing  of  data  over  a  UART  running  at  460800 
baud.  In  total,  when  collecting  signal  probability  data  over 
a  range  of  85  ps  with  0.025  ps  resolution,  with  1  sample  of 
size  50  at  each  timing  point,  the  time  is  5  minutes  for  each 
set  of  512  paths. 

Precision:  To  quantify  the  precision  of  the  measurement 
approach,  self-test  was  conducted  multiple  times 
consecutively.  For  each  of  the  four  types  of  paths,  the  test 
configuration  was  loaded,  the  set  of  512  paths  was  tested 
over  a  span  of  5  minutes,  and  the  testing  was  repeated  for  a 
total  of  3  measurements  over  15  minutes.  Ambient 
temperature  was  controlled  to  25 °C  using  a  thermal 
chamber.  The  standard  deviation  of  measurements  of  the 
same  path  turns  out  to  be  extremely  small  as  shown  in 
Table  1,  averaging  0.063  ps  for  all  2048  paths.  This  level 
of  repeatability  is  10x  finer  than  the  measurements  of  a 
single  path  in  [1],  and  roughly  50 x  finer  than  the 
repeatability  of  the  estimates  in  [2].  When  interleaving 
multiple  measurements,  the  standard  deviation  is  even 
lower,  reaching  0.031  ps. 

The  fine  degrees  of  precision  achieved  are  due  to  a 
combination  of  test  architecture,  data  analysis,  and 
particulars  of  the  implementation  such  as  clock  generator 
jitter  performance.  For  the  data  analysis  in  particular,  we 
can  quantify  the  impact  of  the  various  strategies  for 
mapping  signal  probabilities  to  path  delay.  We  generated 
separate  path  delay  estimates  from  the  same  exact  signal 
probability  data  using  the  three  strategies  mentioned  in 
section  4.3:  the  conventional  Nearest-to-Midpoint  rule,  the 
First  Fail  rule,  and  our  recommended  Sigmoid  Fit.  Results 
show  that  Sigmoid  Fit  enables  substantial  improvements  in 
precision,  achieving  levels  3.3x  better  than  Nearest-to- 
Midpoint,  as  shown  in  Table  2.  The  First  Fail  rule  was 
found  to  be  highly  susceptible  to  outliers  as  expected; 
Sigmoid  Fit  performs  8.2x  better. 

Table  1.  Standard  Deviation,  Consecutive  Measurements 


Path  type 

Ave.  a  for  all 
paths  (ps) 

Worst  case 

CT(PS) 

Even/rising 

0.050 

0.151 

Even/falling 

0.094 

0.234 

Odd/rising 

0.050 

0.176 

Odd/falling 

0.058 

0.169 

Table  2.  Improvement  in  Measurement  Precision  through 
Curve  Fitting  (Sigmoid  Fit) 


Strategy  for 
mapping  data  to 
delay  estimate 

Ave.  a  for 
all  2048 
paths  (ps) 

Worst  case 

CT(PS) 

First  Fail 

0.613 

1.922 

Nearest-to-Midpoint 

0.205 

1.097 

Sigmoid  Fit 

0.063 

0.234 
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Detection  of  Component  Delay  Changes:  To  test  whether 
subtle  component  changes  could  be  detected,  the  2048 
paths  under  test  were  statically  stressed  over  a  period  of 
922  hours  with  a  random  1024-bit  pattern  in  the  associated 
SRAM  cells,  with  occasional  breaks  for  characterization. 
The  chip  junction  temperature  Tj  was  controlled  using  a 
reconfigurable  heater  circuit  to  accelerate  aging. 
Temperatures  were  measured  using  the  on-chip  XADC 
sensor.  The  temperature  schedule  was  319  K  for  82  hours; 
358  K  for  198  hours;  -388  K  for  642  hours.  The  wear  out 
acceleration  factors  at  these  temperatures  relative  to  room 
temperature  (298  K)  is  1.5-3. 9  under  the  assumptions  of  3. 
The  launch  signal  was  connected  to  the  AddrO  input  of 
each  LUT  and  allowed  to  switch  continuously  during 
stress.  After  the  922  hours  of  stress,  the  device  was  left 
powered  off  for  several  weeks  except  for  occasional 
characterization  lasting  -1  hour.  All  measurements  were 
performed  with  TJ  near  30°C. 


We  found  that  evidence  of  sub-picosecond,  differential 
wear  out  is  measurable,  even  after  just  days  of  stress. 
During  stress,  rising  paths  that  had  been  biased  at  1  incur 
an  increasing  slowdown  relative  to  those  biased  at  0. 
Falling  paths  exhibit  the  opposite.  Delay  trends  are  plotted 
in  Fig.  6.  In  just  8.25  days  of  operation  at  85°C  (between 
hours  82  and  280),  a  clear  separation  emerged  for  all  four 
path  types.  For  instance,  at  hour  280,  the  271  even/rising 
paths  biased  at  1  had  slowed  by  0.1  ps  more  than  the  241 
even/rising  paths  biased  at  0.  Once  the  stress  was 
completed  and  the  device  left  powered  off,  the  delays 
largely  stabilized,  validating  that  the  degradation  was 
caused  by  the  stress  and  was  semi-permanent. 
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Figure  4.  Subtle  differences  are  detected  overtime 
between  bias-1  paths  and  bias-0  paths.  Sub-picosecond 
effects  are  seen  in  all  four  path  types. 


Conclusion 

This  research  contributes  methods  for  characterizing 
CMOS  circuitry  at  low  cost  with  astonishingly  fine 
precision  using  COTS  FPGAs  and  novel  self-test.  The 
approach  can  be  extended  to  measuring  thousands  of  paths 
in  parallel.  Whereas  much  previous  work  involves  time 
scales  of  tens  to  hundreds  of  picoseconds,  the  enhanced 
methods  reach  tens  of  femtoseconds.  High  precision 


requires  special  design  and  analysis  methods;  as  one 
example,  our  curve  fitting  provided  a  3.3 x  improvement, 
independent  of  other  factors.  Improving  precision  by  10- 
100x  opens  up  new  possibilities  for  gaining  empirical 
insight  into  the  reliability  of  individual  ICs  and  a  particular 
semiconductor  technology. 
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